Eric Houtman, Rick Lattin, Kellen Long
Kaggle Dataset: https://www.kaggle.com/datasets/andrewmvd/animal-faces?resource=download
Professor Larson's Github: https://github.com/eclarson/MachineLearningNotebooks
Extension Research: https://www.youtube.com/watch?v=yn1NUwaxhZg&t=8s
Our data contains pictures of dogs and cats, and could be useful in identifying pictures of animals as either dogs or cats. The data was collected to be used in a case exactly like this, for training and identifying images of either dogs or cats in order to correctly identify their species.
Intuitively, the prediction task would be to classify each animal as either a dog or a cat and this could be used by third parties such as vets, animal shelters or pet care places. This could be used to take automated inventory for any of these industry, determining the amount and distribution of the cats and dogs in a given facility. It could also be used to determine which animals are which cats or dogs in order to distribute necessary substances to each of them, such as medicines that are specific to cats or dogs.
As for the accuracy of the results, the use cases involving medicine would need to have an extremely high level of accuracy, but almost any other classification task could have a more broad margin of error.
#load the images from the folder as numpy arrays
import numpy as np
import matplotlib.pyplot as plt
#import os and cv2
import os
from os import listdir
import cv2
#set the path to the folder with the images
cat_folder = "/Users/erichoutman/Desktop/SMU/Spring 2023/CS 5324-ML/Lab 2/train/cat"
dog_folder = "/Users/erichoutman/Desktop/SMU/Spring 2023/CS 5324-ML/Lab 2/train/dog"
We imported all of the necessary libraries and set the paths for the folder filled with cat and dog images respectively.
#read in the first 1500 images of the cat folder
cat_images = [cv2.imread(cat_folder + "/" + file) for file in listdir(cat_folder) if file.endswith(".jpg")][:1500]
#read in the first 1500 images of the dog folder
dog_images = [cv2.imread(dog_folder + "/" + file) for file in listdir(dog_folder) if file.endswith(".jpg")][:1500]
#give all the cat images a label of 0
cat_labels = [0 for _ in range(len(cat_images))]
#give all the dog images a label of 1
dog_labels = [1 for _ in range(len(dog_images))]
#combine the cat and dog images and labels into one list
images = cat_images + dog_images
labels = cat_labels + dog_labels
#resize the images
images = [cv2.resize(image, (244, 244)) for image in images]
#create a copy of the images that are converted to grayscale
gray_images = [cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in images]
In the cell above, we then read in and labeled all of the cat and dog images before resizing each image, so that the data is separated into two easily groups with equal dimensions on each image. We also altered the coloring of each image to be on a gray scale.
#read these images in as numpy arrays
images_array = np.array(images_gray)
#convert the labels to a numpy array
labels_array = np.array(labels)
#display the first 5 cat images using a subplot
plt.figure(figsize=(20,10))
for i in range(5):
plt.subplot(1,5,i+1)
plt.imshow(images_gray[i], cmap="gray")
plt.title(labels[i])
plt.axis("off")
plt.show()
#display the first 5 dog images using a subplot
plt.figure(figsize=(20,10))
for i in range(5):
plt.subplot(1,5,i+1)
plt.imshow(images_gray[i+1500], cmap="gray")
plt.title(labels[i+1500])
plt.axis("off")
plt.show()
We turned the list of images and list of labels into numpy arrays and printed out the first five dogs and cats images.
#linearize the images to create a table of 1-D image features
images_linear = images_array.reshape(images_array.shape[0], -1)
#display the shape of the linearized images
print(images_linear.shape)
#display the first 5 rows of the linearized images
print(images_linear[:5])
#visualize the first 5 images in the linearized images
plt.figure(figsize=(20,10))
for i in range(5):
plt.subplot(1,5,i+1)
plt.imshow(images_linear[i].reshape(244,244), cmap="gray")
plt.title(labels[i])
plt.axis("off")
plt.show()
(3000, 59536) [[113 104 93 ... 194 231 245] [ 3 3 3 ... 142 141 137] [122 114 95 ... 146 143 142] [107 107 107 ... 115 111 112] [245 247 248 ... 181 183 181]]
We then resize the images to be a single row of integers representing the pixel values so that each pixel is treated as an individual feature.
#x is the original grayscale images
#y is the labels
#names is the names of the classes
X = images_linear
y = labels_array
names = ["cat", "dog"]
n_samples, n_features = X.shape
_, h, w = images_array.shape
n_classes = len(names)
print(np.sum(~np.isfinite(X)))
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
print("Original Image Sizes {} by {}".format(h,w))
0 n_samples: 3000 n_features: 59536 n_classes: 2 Original Image Sizes 244 by 244
Then, We print out data on the dataset of images regarding the number of features, classes and size of the images in order to get a scope of our data.
# helper plotting function
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
#shuffle the images and labels
from sklearn.utils import shuffle
images_linear, labels_array = shuffle(images_linear, labels_array, random_state=0)
#plot the first 18 images
plot_gallery(images_linear, labels_array, h, w)
Now, with our dataset created, we shuffle all of the cats and dogs in the numpy arrays in which they are stored and then print out the first 18 cats and gods in the set (in gray scale) in order to verify that the shuffle functioned correctly.
from sklearn.decomposition import PCA
n_components = 244
h, w = images_array[0].shape
print ("Extracting the top %d eigenfaces from %d faces" % ( n_components, X.shape[0]))
pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 244 eigenfaces from 3000 faces CPU times: user 1min 51s, sys: 20.3 s, total: 2min 12s Wall time: 20.9 s
We conducted PCA on each image in the dataset, instructing the PCA function to reduce the dataset into 244 components. The number of target components was selected to be 244 as each original image had 59,536 components and 244 represents the square root of the that number. Making the target components the square root of the original number of components allows the PCA function to reduce the dimensions of each image much faster and more effectively, while maintaining many of the same features as the original image. We deduced that the PCA would function in this way due to the fact that each image was represented as a perfect square, with each dimension being 244 pixels long. We verified this deduction through testing many different values with the PCA function and it became very obvious that using the square root of the number of components was the most effective method for this form of dimensionality reduction.
# this is a scree plot
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
plot_explained_variance(pca)
Afterwards, we made a scree plot to represent the results of our PCA calculations in order to visualize the data.
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
We displayed 18 images that have been through PCA to visually verify that the PCA had effectively reduced each image to only the most significant features.
def reconstruct_image(trans_obj,org_features):
low_rep = trans_obj.transform(org_features)
rec_image = trans_obj.inverse_transform(low_rep)
return low_rep, rec_image
idx_to_reconstruct = 50
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA')
plt.grid(False)
We reconstruct an individual image based off of the images that have gone through PCA to see how well the PCA data reflects the original image.
n_components = 244
print ("Extracting the top %d eigenfaces from %d faces" % ( n_components, X.shape[0]))
rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(X.copy())
Extracting the top 244 eigenfaces from 3000 faces CPU times: user 1min 49s, sys: 19.1 s, total: 2min 8s Wall time: 18.8 s
PCA(n_components=244, svd_solver='randomized')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=244, svd_solver='randomized')
We conducted randomized PCA on each image in the dataset, instructing the randomized PCA function to reduce the dataset into 244 components. Similarly to PCA, the number of target components for randomized PCA were selected to be 244 as each original image had 59,536 components and 244 represents the square root of the that number. Randomized PCA is essentially using the same algorithm as PCA in order to reduce the dimensions of the images, randomized PCA just uses a randomized sample of the entire dataset in order to make those dimensionality reductions, effectively saving some run time. We did test the 244 target components against many other options for the number of components and, just as it did for PCA, it became very obvious that 244 was the fastest and most effective number of dimensions to require.
eigenfaces = rpca.components_.reshape((n_components, h, w))
plot_explained_variance(rpca)
Afterwards, we made a scree plot to represent the results of our random PCA calculations in order to visualize the data.
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
Afterwards, we displayed 18 images that have been through randomized PCA to visually verify that the RPCA had effectively reduced each image to only the most significant features.
In this section, we reconstruct each image based on its pca and rpca components. We then compare the reconstructed images with the original images using the widget seen below in order to see the effectiveness of the two methods.
import ipywidgets as widgets
import warnings
# warnings.simplefilter('ignore', DeprecationWarning)
# warnings.simplefilter("always",DeprecationWarning)
def plt_reconstruct(idx_to_reconstruct):
idx_to_reconstruct = np.round(idx_to_reconstruct)
x_flat = images_linear[idx_to_reconstruct].reshape(1, -1)
reconstructed_image = pca.inverse_transform(pca.transform(x_flat.copy()))
reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(x_flat.copy()))
plt.figure(figsize=(15,7))
plt.subplot(1,3,1) # original
plt.imshow(x_flat.reshape((h, w)), cmap=plt.cm.gray)
plt.title("Original")
plt.grid(False)
plt.subplot(1,3,2) # pca
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title(f"Full PCA, {n_components} elements")
plt.grid(False)
plt.subplot(1,3,3) # randomized pca
plt.imshow(reconstructed_image_rpca.reshape((h, w)), cmap=plt.cm.gray)
plt.title(f"Randomized PCA, {n_components} elements")
plt.grid(False)
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,images_linear.shape[0]-1,1))
interactive(children=(IntSlider(value=1499, description='idx_to_reconstruct', max=2999), Output()), _dom_class…
<function __main__.plt_reconstruct(idx_to_reconstruct)>
In order to further explore our data and to quantitatively show the difference between PCA and Randomized PCA, we began by analyzing the explained variance of each algorithm on the same dataset.
#show the explained variance of both PCA and randomized PCA
plot_explained_variance(pca)
plot_explained_variance(rpca)
#calculate and display the explained variance of PCA and randomized PCA
print("PCA explained variance: {}".format(np.sum(pca.explained_variance_ratio_)))
print("Randomized PCA explained variance: {}".format(np.sum(rpca.explained_variance_ratio_)))
PCA explained variance: 0.9024461919932214 Randomized PCA explained variance: 0.902469958774991
During our analysis of each method, we found the explained variance to be almost the very same, with Randomized PCA having a slightly higher explained variance.
Moving forward with this, we decided to time each method as well, in order to further measure the differences between the two.
print("PCA time: ")
%time pca.fit(X.copy())
print(" ")
print("Randomized PCA time: ")
%time rpca.fit(X.copy())
PCA time: CPU times: user 1min 47s, sys: 17.7 s, total: 2min 5s Wall time: 20.7 s Randomized PCA time: CPU times: user 1min 48s, sys: 17.2 s, total: 2min 5s Wall time: 17.8 s
PCA(n_components=244, svd_solver='randomized')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=244, svd_solver='randomized')
After timing each method, we found that Randomized PCA was faster than PCA which would overall be more beneficial when representing images with fewer components, as it provided a slight increase in explained variance, while also being faster than PCA by a significant amount.
In the end, we believe that for our prediction task, we preferred using Randomized PCA for dimensionality reduction, as it was shown to be faster than PCA, while also providing a slight increase in explained variance.
from skimage.feature import daisy
features, img_desc = daisy(img,
step=30, # distance between centers of the descriptor regions
radius=30, # radius of the descriptor regions
rings=2, # number of rings
histograms=8, # number of histograms
orientations=8, # number of orientations
visualize=True)
imshow(img_desc)
plt.grid(False)
As seen in the diagram above, we decided to perform Daisy feature extraction with the parameters seen above, as we believed that these constraints well covered our dataset's features, while also leaving room for features like the overall shape of the animal.
# create a function to take in the row of the matrix and return a new feature
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape), step=30, radius=30, rings=2, histograms=8, orientations=8, visualize=False)
return feat.reshape((-1))
# apply the function to each row of the matrix
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X, (h,w))
print(daisy_features.shape)
CPU times: user 4min 4s, sys: 1.59 s, total: 4min 5s Wall time: 4min 5s (3000, 6664)
This creates a function to apply the DAISY feature extraction algorithm to the dataset, before timing the DAISY function as it is applied to every image in the dataset.
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(daisy_features)
CPU times: user 3.69 s, sys: 565 ms, total: 4.25 s Wall time: 517 ms
We then created a distance matrix to be used when performing nearest neighbor classification in the next cell.
#import imshow from matplotlib
from matplotlib.pyplot import imshow
import copy
# find closest image to current image
idx1 = 5
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)
plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
plt.imshow(X[idx1].reshape((h,w)), cmap='gray')
plt.title("Original Image")
plt.grid()
plt.subplot(1,2,2)
plt.imshow(X[idx2].reshape((h,w)), cmap='gray')
plt.title("Closest Image")
plt.grid()
import copy
from sklearn.metrics.pairwise import pairwise_distances
def closest_image(dmat_daisy, idx1):
distances = copy.deepcopy(dmat_daisy[idx1,:])
distances[idx1] = np.infty
idx2 = np.argmin(distances)
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
plt.imshow(X[idx1].reshape((h,w)), cmap='gray')
plt.title("Original Image")
plt.grid()
plt.subplot(2,2,2)
plt.imshow(X[idx2].reshape((h,w)), cmap='gray')
plt.title("Closest Image (Daisy)")
plt.grid()
widgets.interact(closest_image,idx1=(0,n_samples-1,1), dmat_daisy=fixed(dist_matrix), __manual=True)
interactive(children=(IntSlider(value=1499, description='idx1', max=2999), Output()), _dom_classes=('widget-in…
<function __main__.closest_image(dmat_daisy, idx1)>
#show the first 5 images with their closest daisy image
for i in range(5):
closest_image(dist_matrix, i)
for i in range(5):
closest_image(dist_matrix, i+1500)
After our analysis of the Daisy feature extraction method, we believe that this classification method would be useful for our prediction task, as the results shown in the nearest neighbor classification were very accurate. This accuracy, however did come with a drawback, as Daisy feature extraction with our dataset and parameters took a much longer time to complete than the other methods we tested. Even without 100% accuracy, we believe that this method could still be useful for our prediction task with the higher accuracy, however we also believe that those interested would need to allow for a much longer time to complete the classification.
import numpy as np
import cv2
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
ksize = 10
sigma = 5
theta = 1*np.pi/2
lamda = 1*np.pi/4
gamma= 1
phi = 0.6
kernel = cv2.getGaborKernel((ksize, ksize), sigma, theta, lamda, gamma, phi, ktype=cv2.CV_32F)
plt.imshow(kernel)
#grab the first image
img = X[5].reshape((h,w))
imshow(img, cmap='gray')
fimg = cv2.filter2D(img, cv2.CV_8UC3, kernel)
kernel_resized = cv2.resize(kernel, (400, 400)) # Resize image
plt.imshow(kernel_resized)
plt.imshow(fimg, cmap='gray')
<matplotlib.image.AxesImage at 0x7fab9af60a00>
We tested different variables to be used with the Gabor Kernel Feature Extraction method, and found that the parameters seen above provided the best results for our dataset.
# apply gabor filter to all images
def apply_gabor(row,shape):
feat = cv2.filter2D(row.reshape(shape), cv2.CV_8UC3, kernel)
return feat.reshape((-1))
# apply to entire data, row by row,
%time gabor_features = np.apply_along_axis(apply_gabor, 1, X, (h,w))
print(gabor_features.shape)
%time dist_matrix_gabor = pairwise_distances(gabor_features)
#perform closest image with gabor filter
def closest_image_gabor(dmat_gabor, idx1):
distances = copy.deepcopy(dmat_gabor[idx1,:])
distances[idx1] = np.infty
idx2 = np.argmin(distances)
plt.figure(figsize=(10,16))
plt.subplot(1,3,1)
#turn the picture back to grayscale
imshow(X[idx1].reshape((h,w)), cmap='gray')
plt.title("Original:"+names[y[idx1]])
plt.grid()
plt.subplot(1,3,2)
imshow(X[idx2].reshape((h,w)), cmap='gray')
plt.title("Gabor Closest:"+names[y[idx2]])
plt.grid()
#for the third picture, show the closest picture with the gabor filter applied
plt.subplot(1,3,3)
imshow(gabor_features[idx2].reshape((h,w)), cmap='gray')
plt.title("Gabor with filter:"+names[y[idx2]])
plt.grid()
#create widget for gabor filter
widgets.interact(closest_image_gabor,idx1=(0,n_samples-1,1),
dmat_gabor=fixed(dist_matrix_gabor),
__manual=True)
CPU times: user 2.47 s, sys: 215 ms, total: 2.69 s Wall time: 2.39 s (3000, 59536) CPU times: user 31.7 s, sys: 2.56 s, total: 34.3 s Wall time: 5.39 s
interactive(children=(IntSlider(value=1499, description='idx1', max=2999), Output()), _dom_classes=('widget-in…
<function __main__.closest_image_gabor(dmat_gabor, idx1)>
#show the first 5 images with their closest gabor image
for i in range(5):
closest_image_gabor(dist_matrix_gabor, i)
plt.show()
Similar to the previous section with the Daisy feature, we decided to analyze the performance of our Gabor Filter feature extraction by using a nearest neighbor classification. Through this classfication, we found that the accuracy of the classification seemed to be higher than that of PCA, however it was not as high as that of Daisy feature extraction, with several misclassifications being apparent. From a visual perspective, we believe that Gabor filter extraction is much better at identifying the overall shape of faces from animals, however this would contribute to misclassifications of certain animals when the shape more closely related to the opposite type of animal.
Overall, we believe that this method could still be useful for our prediction task, however we believe that the Daisy feature extraction method would be more useful, as it was able to achieve a higher accuracy even though it took longer to complete.